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Forecasting a time series from multivariate predictors constitutes a challenging problem, especially using 
model-free approaches. Most techniques, such as nearest-neighbor prediction, quickly suffer from the curse of 
dimensionality and overfitting for more than a few predictors which has limited their application mostly to the 
univariate case. Therefore, selection strategies are needed that harness the available information as efficiently 
as possible. Since often the right combination of predictors matters, ideally all subsets of possible predictors 
should be tested for their predictive power, but the exponentially growing number of combinations makes such 
an approach computationally prohibitive. Here a prediction scheme that overcomes this strong limitation is 
introduced utilizing a causal pre-selection step which drastically reduces the number of possible predictors to 
the most predictive set of causal drivers making a globally optimal search scheme tractable. The information- 
theoretic optimality is derived and practical selection criteria are discussed. As demonstrated for multivariate 
nonlinear stochastic delay processes, the optimal scheme can even be less computationally expensive than com¬ 
monly used sub-optimal schemes like forward selection. The method suggests a general framework to apply the 
optimal model-free approach to select variables and subsequently fit a model to further improve a prediction or 
learn statistical dependencies. The performance of this framework is illustrated on a climatological index of El 
Nino Southern Oscillation. 


I. INTRODUCTION 

Predicting the future behavior of complex systems from 
measured time series constitutes a major goal in many fields 
of science. Traditionally, this problem has been mainly ad¬ 
dressed in terms of model-based approaches, often using lin¬ 
ear models 111 . As an alternative, since the late 1980s also 
model-free predictions have been developed using nearest 
neighbors in state space (SHll or neural networks ISUTOl. In 
the nearest-neighbor technique, states similar to the present 
state are searched for in the past of a time series Y and a fu¬ 
ture value at a prediction step h is forecasted by simply 
averaging the Y values /i-steps ahead of the nearby past states 
or using local-linear models |2| Model-free predictions have 
so far been mostly univariate where states are usually recon¬ 
structed from embedding a single time series using Taken‘s 
theorem (31 [mini. However, the intertwined nature of com¬ 
plex systems calls for multivariate approaches taking into ac¬ 
count more available information. Now the problem is that the 
curse of dimensionality makes useful nearest-neighbor predic¬ 
tions almost impossible for more than a few predictors d, 
especially if many of these predictors carry redundant infor¬ 
mation. 

From an information-theoretic perspective, the minimal set 
of variables that maximizes the (multivariate) mutual infor¬ 
mation (H with a target variable is most predictive Ed. 
Minimality is required to avoid the curse of dimensionality. 
It is important to note that this set of variables can be dif¬ 
ferent from those with individually large mutual information 
with the target variable. Indeed, sometimes the right combi¬ 
nation of predictors matters. For example, if Y is driven mul- 
tiplicatively by X • Z, the mutual information of each of these 
predictors with Y can be very low and only the mutual infor¬ 
mation of the combined set (X, Z) with Y is very high. In 
general, such synergetic sets can only be detected by search¬ 
ing through all subsets of variables. However, the number of 


possible combinations for taking into account more variables 
and larger time lags grows exponentially making such a search 
strategy prohibitive due to computational constraints. 

Therefore, simple search strategies such as ranking the pre¬ 
dictors by their mutual information with a target variable or 
the CMI-forward selection using conditional mutual infor¬ 
mation (CMI) have been proposed recently M- Here we 
demonstrate that such approaches can fail already in simple 
cases where one cannot avoid to test different subsets of pre¬ 
dictors. However, we information-theoretically prove that the 
search can be restricted to causal drivers. To obtain these 
drivers, we exploit a recently developed model-free algorithm 
to consistently reconstruct causal drivers from multivariate 
time series that keeps the estimation dimension as low as pos¬ 
sible with typically low computational complexity ifTTl . The 
much smaller set of causal drivers then allows for a globally 
optimizing search strategy which is optimal also for synergetic 
cases. In this contribution, we additionally provide a practical 
criterion for selecting the optimal size of the subset of predic¬ 
tors which compares well even with computationally expen¬ 
sive cross-validation approaches. In numerical experiments 
we found that our optimal scheme yields much improved pre¬ 
dictions and often even runs faster than forward selection. 

Our method suggests a general framework not only for pre¬ 
diction, but also for general statistical inference problems for 
datasets (not only time series) where the underlying mecha¬ 
nisms are poorly understood: Firstly, the optimal model-free 
approach can be applied for selecting not only causal, but also 
possibly synergetic driving variables and, secondly, these vari¬ 
ables can be used to fit a model to learn the functional form 
of the dependencies on these causal predictors. This approach 
combines the advantage of a model-free approach to detect 
relevant variables with the advantage of model-based meth¬ 
ods to efficiently harness these variables to further improve 
predictions or understand mechanisms. Our framework is il¬ 
lustrated on a sea-surface temperature based index of the El 
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Nino Southern Oscillation (ENSO) in the tropical Pacific. 

This paper is organized as follows. After deriving the 
optimality of causal predictors in Sect. we discuss com¬ 
mon approaches for information-theoretic variable selection 
for predictions in Sect. The optimal scheme is explained 
in Sect. including the causal pre-selection algorithm and 
selection criteria. Section|V]discusses the computational com¬ 
plexity of the different schemes. In Sect. |V^ we compare our 
scheme with other approaches in a model example. Exten¬ 
sive numerical experiments on multivariate nonlinear stochas¬ 


tic delay processes are conducted in Sect. VII In Sect. VIII 


we analyze the combination of the model-free selection with 


a model-based prediction scheme which is applied in Sect. IX 
to predict future values of the considered ENSO index. 


II. OPTIMAL PREDICTION 


The discrete-time evolution of a subprocess V of a multi¬ 
variate A^-dimensional stochastic process X can be described 
by 


i"t+i = / r/f+i) , (1) 

with some function /(•), where Vvt^i C = 

(Xt, Xt_i, ...) are the deterministically driving variables in¬ 
cluding the past of Y at possibly different time lags and 
represents stochastic noise driving Y. The uncertainty about 
the outcome of Vt+i can be quantified by the Shannon entropy 
El which decomposes into 


F(Ft+i) = /(Ft+i; X-+i) + \ X'+J , (2) 

where the latter term is the source entropy |[T8l[T3|. This con¬ 
ditional entropy quantifies the minimum level of uncertainty 
that cannot be predicted even if the whole past (and present) 
X^^ is known. If the dependency of / on the noise term 
is additive, the source entropy equals the entropy of the noise: 
H{Yt+i I The infinite-dimensional mul¬ 

tivariate mutual information (MMI) /(It+i ; x,-+i), on the 
other hand, quantifies the predictable part by measuring by 
how much the uncertainty about the outcome of can be 
maximally reduced if X^^ was perfectly measured. 

In practice, a prediction using the entire set X^^ (truncated 
at some maximal lag) would severely suffer from the curse of 
dimensionality and overfitting El, which means that many 
variables do not actually carry useful information, but merely 
fit the noise in the time series, and the prediction - trained on 
a learning set - would perform poorly on a test set. The goal 
is, thus, to select a small subset of predictors from X^^ that 
carries a maximum of information about Vt+i and still gener¬ 
alizes well on new data. However, for an A^-variate process 
X truncated to a maximum delay Tmax the number of possible 
subsets grows exponentially in N and Tmax- 

To avoid this combinatorial explosion, simple search strate¬ 
gies such as ranking the individual predictors by their mutual 
information (MI-scheme) with the target variable can be used 
which, however, is prone to include redundant variables that 
do not improve a prediction. Eorward selection, a more ad¬ 
vanced technique, iteratively determines predictors based on 
how much information they contain additionally to the already 


chosen variables using conditional mutual information (CMI- 
scheme) El leading to a polynomial computational complex¬ 
ity. These strategies will be discussed in Sect. |I^ But for¬ 
ward selection is not a globally optimal strategy, one reason 
being that it might select variables that are not deterministi¬ 
cally driving It+i, the other that it fails to detect synergetic 
cases as demonstrated in our model example in Sect. fVl 

The unknown deterministic drivers in Eq.^) are, 

however, key to arrive at optimal predictors as can be mown 
by decomposing the MMI in Eq. Q using the chain rule ifTHl 

/(Ft+i;X-+i)=/(yt+i;Py,^J + 

+ (3) 

V _ ^ 

=0 

where \ denotes the set subtraction. The second term is zero 
for processes satisfying the Markov property which states that 
Ft+i is independent of the remaining past given its causal par¬ 
ents VvtJ^ n term that originates from the theory of graphical 
models 1^ . Eor multivariate time series these are called time 
series graphs |[l2l|2ll|22l. This proves that, theoretically, all 
information is already contained in the causal parents. Adding 
a variable from X^^ would not increase the information, but 
removing a parent from Vyt+i would decrease the MMI in 
Eq. The causal parents can be efficiently estimated by the 
algorithm described in Sect. |IV A| The Markovity rests on the 
assumption that the noise term driving Y is independent 
of the noise terms driving the other variables. While this as¬ 
sumption is not strictly fulfilled in many real world systems, 
it often at least approximately holds. This assumption is also 
not as crucial for the prediction task as it is for the causal in¬ 
ference problem. 

However, for finite time series, some predictors in 
even though causal, could only be weakly driving and lead to 
overfitting since they do not generalize well on new data. It is, 
therefore, crucial to optimize the selection of a minimal sub¬ 


set of causal parents. Since the set of causal parents Vvt+i is 
much smaller than XT i, this can now be done using a global 


optimization strategy. In Sect. IV B we present such a strategy 
to select the optimal subset ^ of causal predic¬ 

tors. 

Eor predictions h> 1 steps into the future, the set of causal 
predictors VvtJ^h is not identical with the parents any¬ 

more as for h = 1 because predictors can only be chosen 
among the observed variables X^^ = (X^, X^-i, ...) prior 
to t + 1. Still the same algorithm as for the causal parents can 
be used to obtain the set of variables that separates f^om 
X^^ \ VYt+h i^ series graph. In Eig.[^i) an exam¬ 

ple of such a graph is given. As defined in ifTTll . each node in 
that graph represents a subprocess of a multivariate discrete 
time process X at a certain time t. Nodes are connected by 
a directed link if they are not independent conditionally on 
the past of the whole process, which implies a lag-specific 
Granger causality with respect to xED. Using these causal 
predictors, the only uncertainty left comes from the source en¬ 
tropy of Ytyh and the entropy from the unobserved ancestors 
of Ytyh between t+1 and tYh—1 (see Eig.[2i)). 

In the following sections we discuss and numerically com¬ 
pare the four prediction schemes mentioned above: (1) MI 
selection, (2) CMI-forward selection, (3) CMI-forward selec¬ 
tion of only causal predictors, and (4) our optimal scheme. 
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(i) Pre-select predictors with causal algorithm 



(ii) Estimate for all subsets of 



p = 2 p = '^ 





Figure 1. (Color online) Optimal prediction scheme, (i) Causal pre¬ 
selection in an example time series graph (see text) for predicting 
Yt+ 2 - Even though Z could have a high mutual information with 
Y, its influence is only indirect through the parents Xt and Yt+i 
of Yt+ 2 - Among these the latter already lies in the unobserved fu¬ 
ture, but part of its information can still be recovered by measur¬ 
ing Yt and Xt-i which share information along the paths marked 
with black arrows. These variables form the causal predictors 
(blue boxes) for h — 2, which can be found by determining the 
Markov set. A suitable algorithm for this task will be discussed in 
Sect. |IV A| All paths from nodes further in the past to YtY 2 have to 
pass through this set. (ii) Selection of optimal subset from causal pre¬ 
dictors. For all numbers of predictors p, the multivariate mutual in¬ 
formation ; YtYh) of all possible subsets is estimated, (hi) 

The optimal predictors are those where the estimated MMI takes its 
maximum or can be obtained from cross-validation. For example, 
if the optimal predictors were (Ft, Xt-i), only time series samples 
(blue shaded) of these predictors are used with a nearest-neighbor 
(Sec. IV C) or model-based (Sec. VIII) prediction scheme to fore¬ 
cast the future value t-i-h of the time series of Y. 


III. MI AND CMI PREDICTION SCHEMES 


one can either apply a heuristic criterion or cross-validation 
ca In the model experiments in Sects. [V^ and |VII| we 
evaluate both approaches, employing the heuristic criterion 
that the MI of the ranked predictor should be at least 
a fraction A of the MI of the previous predictor i.e., 

I{X(p)-Yt+h) > Awith A e [0,1). The 
last of the ranked predictors that satisfies this criterion deter¬ 
mines the selected number p of predictors. This scheme has 
the drawback that two or more predictors with high MI with 
the target variable might contain highly redundant informa¬ 
tion. Then the inclusion of redundant predictors leads to over¬ 
fitting which will be discussed in Sect. [Vl| 

The second prediction scheme, CMI-forward selection, 
overcomes this drawback by excluding information already 
contained in the previous predictors CSl: First the Mis 
I{Xt-r] ^t+h) for all Xt-T ^ are estimated. The first 

predictor X^^^ is the one that maximizes the MI with the tar¬ 
get variable (i.e., the same one as in the MI-selection scheme). 
The next predictor X^‘^\ however, is chosen according to the 
maximal CMI I{Xt_^; Yt_^h\X^^^) among all remaining pre¬ 
dictors, the third predictor is the maximum CMI conditional 
on the two previously selected predictors, etc. In each step p, 
the CMI gives the gain to the MMI if this predictor is included 
because 


"-V-" 

MMI without X (p) 

+ Ft+ftKXd),..., X(P-i))). (4) 

-V-" 

gain due to X (p) 

Here the heuristic criterion ca is to select as the best p the 
last p with at least a gain of 

I{X^P^-,Yt+h\X^\...,X^P-^^)) 

> (5) 

where A G [0,1) as before. In |[T^ also an adaptive choice 
of p using a shuffle test is discussed. This scheme has been 
proposed to infer causal drivers in GSI. However, it can be 
shown to fail for this task already in simple cases which will 
be shown in Sect.lvll 

Rather than with a heuristic criterion, at the cost of addi¬ 
tional computation time, the best number p of predictors can 
also be chosen by cross-validation. Here we use an m-fold 
cross-validation where the available observed set of time in¬ 
dices T = {l,...,T}is partitioned into m complementary 
segments. For each validation round, a fold m is retained as 
the validation set Tm on which the prediction performance is 
evaluated. The nearest-neighbors are searched for in the com¬ 
plementary set T\Tm from which also the prediction estimate 
is generated. Then the number p of predictors where the aver¬ 
age prediction error across all m folds is minimal is chosen. 


In the first prediction scheme, Ml-selection, the respec¬ 
tive MI of each variable in np to a maximum lag Tmax 
with the target variable is estimated. Then the predic¬ 
tors X^’^ G nre ranked by their MI: I{X^^^;Yt-yh) > 

I{X^‘^^;Yt^h) > > • • •• To determine the 

best number p of the ranked predictors that should be used, 


IV. OPTIMAL PREDICTION SCHEME 

For the prediction of given the multivariate time se¬ 
ries X, our proposed optimal prediction scheme consists of 
the following steps (Fig. [^: (i) Estimate the causal predic¬ 
tors Vvt+h ^ nsing the causal algorithm described in 
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the next Sect. IV A (ii) check all subsets (except the empty 


one) and select the p causal predictors 


(p) 


Q with 


the maximal estimate of the MMI I{Vy'’ ; Yt+h) with the 




target variable as the optimal ones (Sect. |IV B] ), (hi) use these 
predictors to forecast the target variable with nearest-neighbor 
prediction (Sect. |IV C| ). In the following, we explain the causal 
pre-selection algorithm and discuss criteria for selecting the 
optimal subset. While here the actual prediction is conducted 
using a nearest-neighbor scheme (21, in Sect. | VIII we will also 
discuss how a model-based prediction based on the inferred 
optimal predictors can further improve a forecast. 


A. Causal pre-selection algorithm 

The causal pre-selection algorithm is a modification of the 
algorithm introduced in CD, which is based on the PC al¬ 
gorithm (23l [24l (named after its inventors Peter Spirtes and 
Clark Glymour). The main idea is to iteratively unveil the 
causal predictors by testing for independence between pairs 
of nodes in the time series graph conditional on a subset of 
the remaining nodes. Since these conditions are efficiently 
chosen, the dimension stays as low as possible in every iter¬ 
ation step. This important feature helps to alleviate the curse 
of dimensionality in estimating CMIs mil affecting the com¬ 
putation time as well as the reliability of conditional indepen¬ 
dence tests. Under some mild assumptions discussed below, 
the algorithm yields a consistent estimate of Vvt^h • Instead of 
the commonly used binning estimators where the curse of di¬ 
mensionality is especially severe, here we utilize an advanced 
nearest-neighbor estimator ||25]| that is most suitable for vari¬ 
ables taking on a continuous range of values. This estima¬ 
tor has as a free parameter the number of nearest neighbors k 
which determines the size of hyper-cubes around each (high¬ 
dimensional) sample point. Small values of k lead to a lower 
estimation bias but higher variance and vice versa. There¬ 
fore, we choose different values in the algorithm and 

the subsequent selection schemes (/^cmi/mmi^ that for 
an estimation from (multivariate) time series stationarity is re¬ 
quired. 

The algorithm starts with no a priori knowledge about the 
drivers and iteratively learns the set of predictors of Y : First, 
estimate unconditional dependencies ^t+h) and ini¬ 
tialize the preliminary predictors Vvt+h — {^t-r ^ • 

I{Xt-r] yt-\-h) > 0}. This set contains also non-causal pre¬ 
dictors which are now iteratively removed by testing whether 
the dependence between and each Xt-r ^ 'PYt+h condi¬ 
tioned on the incrementally increased subset Pyl+h — 
vanishes: 

(n.) Iterate n over increasing number of conditions, starting 
with some no > 0: 

(n.i) Iterate i through all combinations of picking n 
nodes from ,, to define the conditions 

in this step, and estimate /(X^-r; Yt+h \ Py^h) 
for all Xt-r ^ Pvt+h- After each step the nodes 
Xt-r with I{Xt-r;Ytyh\pYllJ = 0 ^re re¬ 
moved from P+t+h iteration over i stops if 

all possible combinations have been tested. (In the 


implementation we limit the number n^ of combi¬ 
nations and check relevant combinations first, see 
(TTllJbl for details.) 

If the cardinality \PYt+h I — algorithm converges, 
else, increase n by one and iterate again. (In the im¬ 
plementation we limit the dimensionality up to some 
^max- If the initial number of conditions is no > 1 to 
speed up the algorithm, also previously skipped com¬ 
binations with n < \PYt+h I ^ced to be checked before 
convergence can be assessed.) 

The main assumptions underlying the identification of the 
conditional independence structure with the PC algorithm are 
the Causal Markov Condition, i.e., Markovity of the process 
of any finite order, and Faithfulness, which guarantees that 
the graph entails all conditional independence relations true 
for the underlying process and can be violated only in cer¬ 
tain rather pathological cases (241 . If these assumptions are 
fulfilled, the causal algorithm is universally consistent, imply¬ 
ing that the algorithm will converge to the true causal predic¬ 
tors with probability 1 for infinite sample size (24l . On the 
other hand, the CMI forward selection algorithm proposed for 
causal inference in ca is not consistent since it yields non- 
causal drivers already in simple cases which will be analyzed 
in Sect.lyll But the CMI forward selection scheme can be ‘re¬ 
paired’ by a further backward elimination step l27l . 

Practically, the causal algorithm involves conditional in¬ 
dependence tests for I{Xt-r]Ytyh\PYXy) — 0. In 

da a shuffle test is proposed for testing whether 
I{Xt-r]YtYh > 0- An ensemble of M values of 

I{Xt_^; Ytyh I Pyt+h) where Xt_^ is a shuffled 

sample of Xt-r, i-e., with the indices permuted. Then the 
CMI values are sorted and for a test at a given a-level, the 
aM-th value is taken as a significance threshold. In CD a nu¬ 
merical study on the detection and false positive rates of the 
algorithm are given. The shuffle test comes at the additional 
cost, that for each conditional independence test M surrogates 
of CMI have to be estimated. An alternative is to apply a fixed 
threshold /*, which has the drawback that it does not adapt to 
the negative bias for higher-dimensional CMIs (26l|28l. The 
algorithm yields different numbers of predictors for different 
chosen fixed thresholds or significance levels and the value 
should be low enough to include weak but possibly syner¬ 
getic causal predictors, but high enough to limit computational 
complexity in the optimization step (Sect. 0 . 


B. Optimal selection criteria 

Once the causal predictors are determined, the optimal sub¬ 
set needs to be chosen (possibly containing all causal predic¬ 
tors). Here we also discuss a scheme using forward selection 
on the causal drivers (causal CMI-scheme). 

For the third prediction scheme, causal CMI-selection, the 
forward selection ranking discussed in Sect. |I^ is applied not 
to the entire set but only to the pre-selected causal pre¬ 

dictors PYt+h • Also here, the same heuristic criterion as for 
the non-causal forward selection or cross-validation can be 
used. 

For the optimal scheme (Fig. [^ii)), the MMI 
HPt+h'’>^tYh) for all subsets of the causal predictors 
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Figure 2. (Color online) Multivariate mutual information (MMI, bot¬ 
tom red bars) and standardized root mean squared prediction error 
(value proportional to lower end of grey bars at top) for all subsets 
of causal predictors for model ( p^ , for details see Sect. |vi] For each 
number of predictors p, the number of possible combinations varies 
according to the binomial coefficient (^). The predictor combina¬ 
tions are sorted by their prediction error. The maximum of MMI and 
also smaller values match the minimum prediction error very well. 
Note that MMI is estimated on the learning set while the prediction 
error is evaluated out-of-sample on the test set indicating that the bias 
of higher-dimensional MMIs here serves as a good proxy for overfit¬ 
ting as discussed in the text. 


from Vvt^h estimated. In Fig.l^the MMI values (ensemble 
average) are plotted for the mod^ example discussed later in 
Sect. VI with \VYt+h\ — ^ causal predictors. Even though 
all seven predictors are causal drivers, the estimated MMI is 
highest for just three of them and even decreases for more 
predictors (according to Eq. the theoretical MMI should 
be maximal for seven predictors). The reason is that the 
estimated MMI is negatively biased for higher dimensions 
1261 |28l |29l and if the additional information contained in the 
predictors does not outweigh this bias, the MMI decreases. 
The bias of the MMI estimator, therefore, implies a penalty 
that avoids overfitting. In our heuristic criterion we exploit 
this property and simply select the subset C Vt+h with 

maximal estimated This model-free data- 

based criterion could be seen as an analogue to model-based 
criteria like the Akaike Information Criterion (AIC) ISOl 
where the penalty is derived from some measure of model 
complexity, but our criterion needs to be further investigated. 
We choose as the nearest-neighbor parameter in the MMI es¬ 
timator = k, where k is the nearest-neighbor parameter 
in the prediction (see next Sect.|IVC). 


In our numerical experiments in Sect. VII we addition¬ 
ally test a combination of the MMI-criterion with cross- 
validation: Eor each number of predictors p we check the 
MMI-criterion to select the optimal combination and then use 
cross-validation to pick the optimal p. This approach gives 
always a slightly better prediction performance than using the 
maximum criterion alone - at the cost of much longer compu¬ 
tation time. Asymptotically, for model-based predictions the 
AIC criterion is equivalent to cross-validation and in our nu¬ 
merical experiments (see appendix Eig. we also find that 
our heuristic MMI-criterion well matches the cross-validation 
choice for longer time series. Note that our use of cross- 
validation treats p as a tuning parameter as is typically done 


ca. One could also treat the choice of a subset as a tun¬ 
ing parameter and run steps (i)-(iii) of the prediction scheme 

(v) 

for every fold. is, however, not a numeric Tuning pa¬ 

rameter’ and different folds in the cross-validation might not 
contain the same subsets. As a result the variance across the 
folds is considerable and it is hard to find the subset with min¬ 
imal cross-validation error. 


C. Nearest-neighbor prediction 


Once the optimal predictors are selected, the actual predic¬ 
tion is conducted here using a scheme with a fixed number 
of nearest neighbors k HI: Eor the optimal set of predictors 

(v) 

determine the distances 

dt,s = W'Pt+h ~ II fo’’ all s e T with S > Tmax + h, 

( 6 ) 

where || • || denotes some norm. Here we apply the maxi¬ 
mum norm as in the nearest-neighbor estimator of the (con¬ 
ditional) mutual information 1^ . r^iax is the maximum lag 
used to estimate Next, we sort the distances in increas¬ 
ing order dt^si < ' yielding an index sequence 

si, 52,_ Now there are two approaches to use these dis¬ 

tances: (i) A fixed distance e is chosen and all points s with 
distance smaller than 5 are taken into account to predict 
Then the coarse-graining level is consistent for all sample 
points, but sometimes there might not be any point within a 
distance 5 Elia. (ii) Here we use a fixed number of nearest 
neighbors which has the advantage that the same number of 
points contributes to a prediction making the estimate more 
reliable. Eor a chosen fixed number of nearest neighbors k the 
future value is then estimated by the conditional expec¬ 
tation and its prediction interval by its standard deviation: 




t-\-h 


k 


SCli+k) = 


i=i 


\ 


k 


i=i 


(7) 


Another option, instead of the expectation, is to fit an autore¬ 
gressive model giving a local-linear prediction O. The sum¬ 
mation can also be weighted with a function of the distance of 
the neighbors, different norms, or kernel-based methods can 
be chosen Ha. 

Eor the number of nearest neighbors k, we use k = 
^CMi/MMi nearest-neighbor parame¬ 

ter in the estimation of CMI or MMI in the selection schemes. 
This choice approximately yields a consistent level of coarse- 
graining in the information-theoretic selection step and the 
nearest-neighbor prediction. Alternatively, at the cost of com¬ 
putation power, one can also utilize cross-validation for this 
choice ca. The number of nearest neighbors needs to be 
balanced to guarantee that only nearby values are used as pre¬ 
dictors, but still enough values are available to confidently es¬ 
timate and possibly the prediction interval. The value 
will typically strongly depend on the data. As a skill metric 
we compute the standardized root mean squared error 


SRMSE = 


1 




( 8 ) 
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No. of causal predictors 


Figure 3. (Color online) Scaling of computational complexity with 
the cardinality of the estimated causal predictors | for Tmax = 

2. Solid lines will be used for N = 10, whereas dashed lines denote 
= 30. The curve for the causal CMI-selection scheme (black) is 
only slightly different from the constant curve for the MI-selection 
scheme (blue). The causal schemes have the advantage that their 
complexity rather depends on how many causal drivers are found, 
while the complexity of the MI- and CMI-selection schemes depends 
on the number of processes and the maximum lag. For the non-causal 
CMI-selection scheme (grey lines) we fixed the maximum number of 
predictors to pmax = 10. To include the complexity due to the prese¬ 
lection step in the causal schemes we added 420 (1260) for iV = 10 
(N = 30) for these schemes according to Eq. 0 for no = 2 and 
ni = 3 as in our numerical experiments (see Sect. Eji}. The plot 
shows that if the number of causal predictors can be reduced below 8 
(10) here, the optimal scheme even takes less computation time than 
the non-causal CMI-selection scheme. 


where ay is the variance of Y in the testing set Ttest • 


V. COMPUTATIONAL COMPLEXITY 


An important criterion for the practical applicability of the 

is 


different prediction schemes discussed in Sects. Ill and IV 


their respective computational complexity. The estimator for 
the (C)MI I{X;Y\Z) employed here (nearest-neighbor tech¬ 
nique with maximum norm 1^ ) has a computational com¬ 
plexity of 0{T‘^{Dx + Dy + Dz)) im. where T is the time 
series length and D the dimensionality of the respective vari¬ 
able. Fast neighbor searching algorithms can further reduce 
the dependency on T, but here we are interested in the relative 
complexity of the different predictor selection schemes and, 
therefore, only consider the linear scaling with the number 
of dimensions. In this case the first prediction scheme, MI- 
selection, clearly is the cheapest option. For a A^-dimensional 
process X, this procedure involves just A^(rniax +1) estimates 
of Mis with a dimensionality of Dx = 1, Dy = 1, Dz = 0. 


The second scheme, CMI-forward selection, is more de¬ 
manding the more possible predictors are included. For cross- 
validation, a maximum number of Pmax predictors has to 
be selected, increasing the dimension to maximally Dx = 
1, Dy = 1, Dz = Pmax — 1 dtie to more conditions in Eq. Q. 


The CMI-forward selection then involves 

Pmax 1 

E] (-^('^max + 1) - P) 

p=0 

Pmax -|- Tj^ax)) 

estimates of CMI with iteratively increasing dimensionality. 
Then the complexity of the CMI-selection scheme scales as 

Pmax 1 

(A^(rmax Y 1) — p){p Y 2) 

p=0 

— gPmax (b 3pniax ^Pmax Y 3N(3 Y Pmax)(l Y 'T'max)) 

for pmax < ^('^max + 1) + 1, i.o., with a high linear depen¬ 
dency on A^(rniax +1) and a polynomial dependency ^ Pmax- 
Note that for a m-fold cross-validation step using nearest- 
neighbor prediction an additional computational complexity 
of 


rji Pmax 

m ■ T— (1 + p) = -T^Pmax(3 + Pmax) 
m 2 

p=i 

has to be added. 

In Fig. we compare the complexity of the different predic¬ 
tion schemes for = 10 and Tmax = 2 as in our numerical 
experiments in Sect. |VII| While one can fix Pmax to a small 
number for which nearest-neighbor predictions yield accept¬ 
able results, the main problem here is that the computation 
time quickly increases with A^(rmax + 1) (linear, but with a 
large pre-factor). One advantage of a causal pre-selection step 
is to reduce this number before the computationally expensive 
selection procedure is invoked. 

Since typically the set of causal predictors is small, i.e., 

I ^ ^(^max + 1), the third prediction scheme, causal 
CMI-selection, has a drastically smaller computational com¬ 
plexity than the non-causal CMI-selection scheme if the ad¬ 
ditional complexity due to the causal algorithm is not that 
large. If we rank all causal predictors (not limiting to some 
Pmax as in the non-causal scheme), the complexity scales as 
+ where denotes the 

number of causal predictors. In Fig.|^ the complexity (black 
lines) shows only a very moderate increase with \Vy^_^^\. 

For the optimal scheme the computational complexity 
grows exponentially as 2 \^n+h\ I) — 1. However, 

Fig. shows that if the number of causal predictors can be re¬ 
stricted, the optimal scheme even takes less computation time 
than the non-causal CMI-selection scheme. The number of 
causal predictors can be reduced by adjusting the significance 
level a or fixed threshold /* in the conditional independence 
tests of the causal pre-selection algorithm. Most important, 
the causal scheme’s complexity only scales with the number 
of causal drivers and not directly with the number of processes 
N or the maximal lag r^ax as the non-causal schemes (dashed 
lines in Fig.[^. The dependence of the causal schemes on N 
and the maximal lag r^ax is only via the algorithm. 

The additional time complexity of the causal algorithm 
varies with the graph structure. The number of iterations can 
be limited by starting with a higher number of initial con¬ 
ditions no and limiting the maximum dimensionality nmax- 
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This number determines up to which dimension of the 

conditional independence is checked. Also the number of 
combinations ui in the i-loop can be restricted. In a worst case 
scenario where the spurious links only vanish if the maximum 
number of conditions is used, the computational complexity 
scales as 

/ "^max 1 

^(^max + 1) ( ^ ^ ^ 

\ n=no i=0 

A^('7”max “1“ 1) “1“ ^max ^o)(4 '^max “1“ • 

However, for sparse graphs and the conditional set being ef¬ 
ficiently chosen El, typically links get removed already for 
an no-dimensional condition with a complexity of 

^('Tmax + 1)(2 + ni(2 + no)), (9) 


and rii = 1 or 2. This is also confirmed in numerical ex¬ 
periments in ifTTIl and Sect. VH Often the complexity is even 
lower because the MI value in the first iteration is already non¬ 
significant. 


VI. MODEL EXAMPLE 

The following nonlinear discrete-time stochastic delay 
equation provides an illustrative example where a simple MI- 
or CMI-forward selection procedure yields non-causal vari¬ 
ables that deteriorate a prediction. Consider 

y*+i = c ^ + b n + ar/f+i 

Xi") = a (W^l\ + (10) 

where the causal drivers W^'\ Z^'\ and r]' are indepen¬ 
dent Gaussian processes with zero mean and unit variance 
^"(0,1)]. This model illustrates why the MI and CMI predic¬ 
tion schemes fail to yield good predictions due to (1) selecting 
non-causal predictors and (2) missing out synergetic predic¬ 
tors. Here the Z^'^i are synergetic causal drivers, which - for 
certain parameters 6, c - are individually less predictive than 
the drivers W^ 'ly But selected all together, their combined in¬ 
formation / ^i-i^ ^t-\) 5 is much larger than 

the single mutual informations. This synergetic effect can 
only be detected if all subsets of causal drivers are tested in 
a globally optimal scheme. In the following we analyze why 
the MI- and CMI-selection schemes fail to provide good pre¬ 
dictions for such cases. 

Regarding the problem of selecting non-causal drivers, for 
certain parameter combinations of (a, 6, c) the mutual infor¬ 
mation ; It+i) is larger than any ; It+i) or 

I{z['\ ; It+i) for all i. The non-causal schemes based on it¬ 
eratively selecting predictors with maximal MI or CMI (blue 
and grey box plots in Fig. will, therefore, choose a non- 
causal prior to the true causal predictors and Z^\. 
Since the predictors W^'^i have the largest MI after the xj'\ 



Figure 4. (Color online) Comparison of prediction schemes for an 
ensemble of 500 realizations of model for a = 0.4, h — 2, c — 
0.4, cr = 0.5 with time series graph given in the inset (learning 
set length 500, test set length 125, nearest-neighbor prediction with 
k = 10 neighbors, (C)MIs estimated from the learning set with pa- 
rameter = 50 in the algorithm and fcCMi/MMi ^ 

optimization). The box and whiskers plots give the ensemble median 
and the interquartile range of the standardized root mean squared 
prediction errors in the test set for each iteration step p in the four 
schemes (from left to right: MI, CMI, causal-CMI, optimal scheme). 
The black line gives the median of the true minimal prediction error 
obtained by minimizing the out-of-sample prediction error for each 
number of predictors p taken from the true causal drivers. For p = 3, 
only the optimal scheme (red) selects the best (synergetic) predictors 
Z [^-^, Z [^\, Z[^-^ and reaches the minimum possible error while the 
causal forward selection (black) first picks one of the less predictive 
W^'2i and the pure forward selection (grey) and Ml-based schemes 
first pick the two non-causal drivers The non-optimal 

schemes include the synergetic predictors only when the higher di¬ 
mensionality is already worsening the prediction due to overfitting. 


these are included next in the Ml-selection scheme. In the 
CMI-forward selection scheme, on the other hand, the syn¬ 
ergetic variables Z^\ are selected after the second iteration 
step. This leads to a drastic decrease in the prediction error at 
p = 5 (grey box plot in Fig. [^. The problem is that now the 
dimension of the predictors is already 5 and the two spuriously 
causal variables xj'^ deteriorate the prediction. 

The causal pre-selection avoids this pitfall. The black and 
red box plots in Fig. denote the schemes based on causal 
predictors using forward selection (black) and the optimal 
scheme (red). Also for causal drivers the forward selection 
scheme is not optimal because the selection of one predic¬ 
tor at a time fails for synergetic cases and selects the weak 
drivers prior to the synergetic drivers Z^'\ Only the op¬ 
timal scheme correctly identifies these drivers for the dimen¬ 
sion p = 3 and reaches the minimal prediction error possible 
for this model (black line) for each p. In Fig. we show that 
the three synergetic drivers Z^'^ are chosen with our heuris¬ 
tic optimal criterion since they have the largest MMI with the 
target variable. 
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Figure 5. (Color online) Results of numerical experiments for an 
ensemble of 500 trials of the synergetic model class (H) with time 
series length T = 500 (125 in the test set) for the four different pre¬ 
diction schemes (from left to right in the panels: MI in blue, CMI 
in gray, causal-CMI in black, optimal in red), (a) The four box plots 
show the ensemble interquartile range of (i) computational complex¬ 
ity (the green box on the right shows the added complexity due to 
the causal algorithm), (ii) the range of numbers of predictors p se¬ 
lected by cross-validation (CV, here the green bar shows the number 
of causal predictors | in the pre-selection step), (iii) the true 

positive rate (TPR) and (iv) false discovery rate (FDR). The latter are 
evaluated for each scheme at p = 8, corresponding to the true num¬ 
ber of causal drivers (or less if fewer causal drivers are estimated in 
the pre-selection step), (b) Box plots showing the median and in¬ 
terquartile range of the prediction error relative to the true minimal 
error obtained by minimizing the out-of-sample prediction error over 
all subsets of true causal drivers. On the left are the results if cross- 
validation is used to optimize p for each scheme (whiskers show the 
5% and 95% quantiles). The red box in the center shows the result 
for the heuristic optimality criterion. The range of boxes on the right 
shows the results for different thresholds A for the other schemes 
(only interquartile range), (c) Box and whiskers plots (showing the 
5% and 95% quantiles) for the absolute prediction error of the opti¬ 
mal scheme at the cross-validated choice of p for different numbers 
of nearest neighbors k. 


VII. NUMERICAL EXPERIMENTS 


We next compare the four schemes including the causal 
pre-selection algorithm on a larger class of synergetic nonlin¬ 
ear discrete-time stochastic processes with different coupling 


Figure 6. (Color online) Numerical experiments on a class of non- 
synergetic, but still nonlinearly coupled generalized additive models 
as analyzed in the supplement of Ref. (nl. Parameters as in Fig.|^ 
but with iV = 4 processes and pmax = 6. The orange box plot in (b) 
shows the relative prediction error if the optimal predictors from the 
model-free selection scheme are used in conjunction with the linear 
auto-regressive prediction model © (only the interquartile range 
shown). 

configurations: 

(xr+i) + 4^1 for j = 2,..., TV, (11) 

where we are interested in predicting (i.e., h = 1). The 
linear function is simply the sum of 5 randomly chosen 

subprocesses (excluding process X^^^) at random lags 

0 < r < 2. The nonlinear function on the other hand, 
is the product of 3 randomly chosen subprocesses (excluding 
process X^^^ and the ones already included in the linear term). 
The other subprocesses for j > I are linearly driven by 2 
other randomly chosen subprocesses, also at random lags. The 
coefficients are fixed to a = 0.4, 6 = 2, c = 0.4, and N = 10. 
With this setup we generate an ensemble of 500 realizations. 

We run the four schemes at different choices of the heuris¬ 
tic parameter A and using cross-validation checking up to 
Pmax = 8 predictors in the non-causal schemes. For the causal 
schemes, we use cross-validation for all estimated causal pre¬ 
dictors (up to maximally Pmax = 8, ranked by their CMI value 
in the algorithm ifTTl ). The causal drivers are estimated using 
the algorithm parameters no = 2,nmax = 3, n^ = 3, and 
^max = 2 with a fixed significance threshold /* = 0.004 
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(analyses for other thresholds are shown in the appendix). 

In Fig. we show various statistics comparing the com¬ 
putational complexities and the prediction errors. Here the 
number of possible predictors is N{Tmax + 1) =30 yielding a 
computational complexity of 60 for the Ml-selection scheme 
(blue) and 1124 for the CMI-selection scheme (grey) using 
Pmax = 8. The causal algorithm reduces the number of possi¬ 
ble predictors to about 7 (median). This corresponds to a true 
positive rate (TPR) of roughly 0.9 (there are 8 true drivers in 
model ( pT| ), but several are only weakly driving) and a zero 
false discovery rate (FDR), while the MI- and CMI-selection 
schemes detect fewer causal drivers and much more false pos¬ 
itives. Fewer predictors result in a lower computational com¬ 
plexity for the causal prediction schemes. The causal CMI- 
selection scheme runs extremely fast (black) and the com¬ 
plexity of the optimal scheme (red) strongly varies among the 
different realizations, since it depends exponentially on how 
many causal predictors are pre-selected (Fig.|^, but still typi¬ 
cally even stays below the non-causal CMI-selection scheme. 
Using cross-validation, the Ml-selection scheme uses typi¬ 
cally (median) p = 5, the CMI-selection schemes bothp = 4, 
and the optimal scheme only p = 3 out of the 8 true causal 
drivers for this model. 

Finally, the relative prediction errors show that only the op¬ 
timal scheme reaches the lowest possible errors with a median 
of zero relative error and even 90% of the ensemble below an 
error of 0.1. This demonstrates the large improvements due 
to the global optimization scheme that is only possible after 
reducing the set of variables to the few causal predictors. 

The aforementioned results have been obtained using cross- 
validation to select the optimal p. The computationally 
cheaper alternative using a heuristic criterion here yields dras¬ 
tic differences in the prediction performance depending on the 
choice of A. While here values in the range A = 20% ... 30% 
give good results, in another experiment (Fig. we found 
good predictions only for A = 10%. ..15% making it hard 
to provide rules of thumb in practical applications. In the ap¬ 
pendix we show that also the length of the time series results 
in different optimal ranges for A. On the other hand, for the 
optimal scheme the heuristic choice leads to almost the same 
minimal errors as in cross-validation. 

To test the robustness of our results, we also compare the 
prediction schemes on a class of non-synergetic, but still non- 
linearly coupled models (generalized additive models ca 
with = 4 processes and polynomials of linear and quadratic 
degree) as analyzed in the supplement of Ref. El For each 
ensemble member, we choose as a target variable the one with 
the largest sum of ‘incoming’ coefficients (absolute values). 
The results shown in Fig. [^demonstrate that for this case also 
the causal CMI-selection scheme reaches optimal prediction 
errors. In the appendix, we show that the optimality is robust 
also for different significance thresholds and other time series 
lengths. 

In Figs.j^c) and Fig.j^c) we evaluate the prediction for dif¬ 
ferent phase space resolutions. To this end we use the causal 
predictors and run steps (ii) and (iii) of the prediction scheme 
from Fig. (using cross-validation to choose ^ for varying 
nearest-neighbor parameters k and MMI estimation parame¬ 
ter k^^^ = k, both set to the same value for consistency. 
While in the first ensemble (Fig.j^c)) the error is minimal for 
very few neighbors and sharply rises if too many neighbors 


are used, in the second ensemble (Fig.j^c)) too few neighbors 
yield worse results. In practice, the choice will very much de¬ 
pend on the process under study, but here we use a value in the 
range k = 5 ... 10 which constitutes a balance between local 
information and enough neighbors to reliably estimate 


VIII. MODEL-FREE SELECTION COMBINED WITH 
MODEL-BASED PREDICTION 

Up to now we have stayed in a model-free framework with 
information-theoretic optimal selection of predictors and a 
nearest-neighbor prediction. While nearest-neighbor predic¬ 
tion is a fiexible method that will adapt to any function / in 
Eq. 0, it will in many cases be outperformed by a model- 
based prediction - if the right model class is chosen. If a mis- 
specified model is chosen for variable selection and fitting, it 
might miss out nonlinear combinations of predictors. For ex¬ 
ample, in our synergetic model a linear selection method 
would only include the weakly predictive variables and 
largely miss out the highly predictive variables Z^'\ The func¬ 
tional dependency on the is, on the other hand, much 
better fitted with a linear model than with nearest neighbors. 

To take advantage of improved model-based predictions 
and at the same time not miss out synergetic predictor combi¬ 
nations, we propose to apply our optimal predictor selection 
scheme and conduct the final prediction step by fitting a model 
on the optimal set of predictors. Here we demonstrate this ap¬ 
proach on the non-synergetic model ensemble from Ref. El 
To predict It+ft from the optimal subset of predictors 
(chosen by cross-validation or the heuristic criterion), we use 
the ordinary least squares regression technique. Then the pre¬ 
diction interval is given by the variance d‘1 of the regression 
residual plus the errors in the estimated regression coefficients 

BID: 


%+h = d{%+H) 



i=l 


( 12 ) 


In Fig.j^b) the results for this approach are shown as the or¬ 
ange box plot. The prediction improvement varies strongly for 
the different realizations which include nonlinear and linear 
drivers. About half of the realizations are better fitted using 
the optimized linear approach with prediction improvements 
of up to 10% compared to the nearest-neighbors prediction. 
More advanced techniques such as generalized additive mod¬ 
els can further improve a prediction El In addition to fa¬ 
cilitating the prediction task, the knowledge of the functional 
forms of dependencies can also help to better understand cou¬ 
pling mechanisms. 


IX. PREDICTING ENSO 

The combined framework developed in the last section is 
now illustrated on a sea-surface temperature index of the 
El Nino Southern Oscillation (ENSO) in the tropical Pacific 
which has been the focus of prediction research for many 
decades due to its far-reaching climatic and economic impacts 
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Figure 7. (Color online) Prediction of the El Nino Southern Oscilla¬ 
tion (ENSO) index Nino3.4 in the period 2003-2014 (up to Decem¬ 
ber) using 1951-2002 as a learning set, causal algorithm run with 
significance threshold /* = 0.03 testing up to Tmax = 12 months, 
(a) Prediction error using nearest-neighbor prediction (solid red line) 
and linear prediction (dotted orange line) versus prediction step h. 
Eor both approaches the same optimal predictors obtained from the 
model-free scheme with cross-validated (5-fold within the learning 
set) choice of predictors are used, (b) Nino3.4 index with El Nino and 
La Nina events marked in red and blue, respectively. The black lines 
denote selected hindcasts and their Icr-prediction intervals (grey) us¬ 
ing the linear prediction. The dots mark the starting time t in May of 
each year, and the predicted values range from June (/i = 1) to Octo¬ 
ber {h — 5). The arrows mark correct (red), missed (grey) and false 
(black) hindcasts of the Nino3.4 index exceeding 0.5 °C with more 
than 49% probability. The green line marks a real forecast starting 
in December 2014 giving a probability for the Nino3.4 index to stay 
above 0.5 °C of 55-70% for the months until May 2015. 


fT\\ |32l. The Nino3.4 index is defined as the average sea- 
surface temperature over the region 5°N-5°S, 170°-120°W 
Ea. As another possible predictor variable, we use an atmo¬ 
spheric index based on sea-level pressure, the Southern Os¬ 
cillation Index (SOI), which is computed from the surface air 
pressure difference between Tahiti and Darwin, Australia. 

In Fig. |7ja), we employ the model-free causal algorithm 
and predictor selection (steps (i)-(ii) in Fig. here optimized 
using cross-validation) to obtain the optimal predictors and 
compare the skill of the nearest-neighbor and the linear pre¬ 
diction using the auto-regressive model fitted on the opti¬ 
mal predictors. Trained on the period 1951-2002, we test the 
prediction on the last decade 2003-2014. From the 24 possi¬ 
ble predictors, the optimal predictor for = 1 month is only 
Nino3.4 at one month lag, while for = 2 months the three 
predictors (Nino3.4t, SOIt, SOIt-g) are relevant indicating 
that the atmospheric coupling, including a long memory, con¬ 
stitutes an important predictive mechanism. Here, the linear 
auto-regressive model significantly reduces the prediction er¬ 


ror by about 0.05 — 0.1 compared to the nearest-neighbor ap¬ 
proach using the same predictors, at least for a few months 
ahead. For steps larger than 5 months, the error in both ap¬ 
proaches quickly reaches 1 which implies that the prediction 
is merely a persistence forecast. The better linear prediction is 
a sign that exploiting the nonlinearities in Nino3.4 M does 
not improve the prediction much while the linear fit using the 
optimal predictors better harnesses the linear drivers of ENSO 
- at least on these time scales 135)1 . 

To give an impression of selected predictions from the lin¬ 
ear model (actually hindcasts), we show in Fig. |7jb) the pre¬ 
dictions up to 5 months ahead starting from May in each year. 
The important onsets of El Nino events are determined by ex¬ 
pert assessment, but one definition is the 3-month-running- 
mean smoothed Nino3.4 index exceeding 0.5 °C, here marked 
by a red line (La Ninas, where the index decreases below 
—0.5 °C are marked in blue). With our hindcasts starting in 
May of each year, one can compute the probability of an El 
Nino event as the part of the prediction distribution exceeding 
0.5 °C (assuming a Gaussian distribution with mean and stan¬ 
dard deviation given by Eq. (p^). If this probability is larger 
than 49% for any of the 5 months ahead (until October), we 
predict an El Nino event. With this scheme we would have 
correctly predicted the moderate El Nino event in 2009 and 
the onset of the weak El Nino in current 2014-2015 season 
(red arrows), but missed the weak events of 2004 and 2006 
(grey arrows, the latter being almost predicted with a proba¬ 
bility of 48%). On the other hand, in 2011 (black arrow) a 
false alarm is given. The overall weak predictability of the 
recent El Nino events is also found in other studies using sta¬ 
tistical as well as physical model predictions 1^ and suggests 
that the mechanism of ENSO could be changing. Finally, our 
real forecast (green line) starting from December 2014 sug¬ 
gests that the weak current El Nino condition persists with a 
probability of 55-70% for the months until May 2015. 


X. DISCUSSION AND CONCLUSIONS 


In this article we have shown that the combinatorial explo¬ 
sion to search for globally optimal subsets of predictors can be 
overcome by restricting the search to causal drivers. Globally 
optimal predictors detect also synergetic mechanisms where 
the combination of multiple predictors strongly improves a 
prediction. Analytical considerations and numerical experi¬ 
ments indicate that such an approach is superior to schemes 
using Ml-ranking or forward selection with conditional mu¬ 
tual information. Another advantage is that the computational 
complexity only scales with the number of causal predictors 
and not directly with the number of processes included in the 
analysis. If the set of causal predictors is not that large, the op¬ 
timal scheme is even computationally less expensive than the 
non-causal CMI-selection scheme. To determine the optimal 
size of this set, we have found that a parameter-free heuristic 
criterion performs almost as good as a computationally much 
more demanding cross-validation. 

Note that, even though theoretically only causal drivers 
can yield optimal predictions, non-causal variables could still 
be better predictors. Consider the case that a very high¬ 
dimensional process W drives Y and X. Then the predic¬ 
tion of Y from the causal drivers W is deteriorated due to 
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the curse of dimensionality for finite samples, while the non- 
causal process X could potentially better aggregate this infor¬ 
mation. The same effect also explains why in Fig.fflthe CMI 
prediction using the non-causal ’ together with the syner¬ 
getic drivers has a slightly smaller prediction error than the 
causal CMI-selection scheme for p = 5 (grey box plot). 
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While we propose the model-free selection of predictors for 
processes where the underlying mechanisms are poorly under¬ 
stood, the actual prediction can be much improved using suit¬ 
able model-based techniques compared to a pure model-free 
nearest-neighbor prediction. This approach combines the ad¬ 
vantage of a model-free approach to detect relevant variables 
with the smaller prediction variance of model-based methods 
and can also be used to better understand coupling mecha¬ 
nisms. The application of this combined approach signifi¬ 
cantly improves the prediction of an ENSO index compared 
to a nearest-neighbor scheme. The combined approach can 
be further improved by optimizing the number of predictors p 
with a different criterion than the model-free criteria discussed 
in Sect. IV B[ Especially linear models can harness much more 
predictors before the problem of overfitting becomes severe. 


Here the scope of application was the prediction of fu¬ 
ture values of a time series. In a forthcoming paper we 
will investigate how the scheme can be adapted if, for exam¬ 
ple, only forecasts for the emergence of extreme events like 
El Ninos are needed. A Python script to estimate the 
causal predictors can be obtained from the author’s website at 
WWW.pik-potsdam.de/members/jakrunge. 


APPENDIX 
I. ROBUSTNESS 

In Pig.we show the results as in Pig.j^for different values 
of the significance threshold /*. Obviously this threshold af¬ 
fects the true positive and false discovery rate, which are, how¬ 
ever, not directly of interest for the prediction task (as opposed 
to the causal inference problem). But a too low significance 
level in the causal pre-selection algorithm leads to a high com¬ 
putational complexity and also increases the variance in the 
optimal subset selection step, which results in higher predic¬ 
tion errors. If, on the other hand, the significance level is too 
high, too few predictors are available to optimize the predic¬ 
tion such that the resulting optimal predictors equal the pre¬ 
selected causal predictors Vvt+h- the significance level is 
adjusted to yield just a few predictors more than the number 
of optimal predictors p (obtained through cross-validation or 
the optimal heuristic criterion), the prediction error is mini¬ 
mal and also the computational complexity is lower than for 
the non-causal CMI-forward selection scheme. 

We also evaluate the prediction schemes for time series 
lengths T = 300 and T = 800. The results shown in Eig.|^ 
demonstrate that the optimal scheme also works for very short 
time series and is even better for longer time series. Eor 
T = 800 and the synergetic model ([nj the optimal scheme 
even results in 75% of the realizations reaching the true mini¬ 
mal prediction error. 
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